InforLorV4, Main, Exploration, bibRecord, 008F12

Towards a common framework for linguistic annotation

Identifieur interne : 008F12 ( Main/Exploration ); précédent : 008F11; suivant : 008F13

Towards a common framework for linguistic annotation

Auteurs : Nancy Ide ; Laurent Romary

Source :

RBID : CRIN:ide01c

English descriptors

KwdEn :
- annotation schemes, linguistic resources, xml.

Abstract

Corpora are now being annotated for a variety of linguistic features, including not only phenomena such as morpho-syntactic category and syntactic structure, but also discourse structure, co-reference, etc. Typically, annotation schemes even those representing the same phenomenon are developed independently at different annotation sites, and as a result, merging and comparison of annotated data is difficult or impossible. Similarly, annotations for different phenomena are difficult to combine to enable consideration of relationships and patterns among different linguistic levels. We have been working to develop a framework for linguistic annotation that would solve most if not all of these difficulties. The framework consists of two fundamental pieces : (1) a generalized, abstract model that captures the underlying structure of linguistic annotations ; and (2) a means to identify and formally define common («core») annotation categories, together with mechanisms to map equivalences, refine or modify existing categories, and specify hierarchical relations among categories at different levels of specificity. The aim is to provide an infrastructure for linguistic annotation that enables the commonality needed for reuse and merging, and is at the same time flexible enough to allow for user-specific annotation practices. To this end, we have outlined an annotation framework instantiated via the Extended Markup Language (XML) and the Resource Definition Framework (RDF) and demonstrated its applicability to the representation of lexical information (Ide, et al., 2000) and syntactic annotation (Ide and Romary, 2001). In this paper, we outline the principles and mechanisms that support the proposed framework and demonstrate its use and flexibility. Because it is based in existing and emerging data representation standards and informed by state-of-the-art methods from areas such as database theory, object-oriented design, knowledge representation, etc., we feel strongly that the framework supports the most advanced and efficient means to exploit annotated corpora. The framework we are proposing would serve as a central repository and service for annotators, providing off-the-shelf formats and categories together with scripts and tools for using and modifying them, creating new categories, and mapping among them. A core feature of the framework is an RDF-based data category registry, whose implementation depends critically on input from the research community. Therefore, our goal is not only to show our results so far, but also to solicit input and feedback from corpus annotators and users that can contribute to further development.

Affiliations:

Links toward previous steps (curation, corpus...)

to stream Crin, to step Corpus: 003012
to stream Crin, to step Curation: 003012
to stream Crin, to step Checkpoint: 001528
to stream Main, to step Merge: 009433
to stream Main, to step Curation: 008F12

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" wicri:score="476">Towards a common framework for linguistic annotation</title>
</titleStmt>
<publicationStmt><idno type="RBID">CRIN:ide01c</idno>
<date when="2001" year="2001">2001</date>
<idno type="wicri:Area/Crin/Corpus">003012</idno>
<idno type="wicri:Area/Crin/Curation">003012</idno>
<idno type="wicri:explorRef" wicri:stream="Crin" wicri:step="Curation">003012</idno>
<idno type="wicri:Area/Crin/Checkpoint">001528</idno>
<idno type="wicri:explorRef" wicri:stream="Crin" wicri:step="Checkpoint">001528</idno>
<idno type="wicri:Area/Main/Merge">009433</idno>
<idno type="wicri:Area/Main/Curation">008F12</idno>
<idno type="wicri:Area/Main/Exploration">008F12</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Towards a common framework for linguistic annotation</title>
<author><name sortKey="Ide, Nancy" sort="Ide, Nancy" uniqKey="Ide N" first="Nancy" last="Ide">Nancy Ide</name>
</author>
<author><name sortKey="Romary, Laurent" sort="Romary, Laurent" uniqKey="Romary L" first="Laurent" last="Romary">Laurent Romary</name>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>annotation schemes</term>
<term>linguistic resources</term>
<term>xml</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en" wicri:score="9860">Corpora are now being annotated for a variety of linguistic features, including not only phenomena such as morpho-syntactic category and syntactic structure, but also discourse structure, co-reference, etc. Typically, annotation schemes even those representing the same phenomenon are developed independently at different annotation sites, and as a result, merging and comparison of annotated data is difficult or impossible. Similarly, annotations for different phenomena are difficult to combine to enable consideration of relationships and patterns among different linguistic levels.  We have been working to develop a framework for linguistic annotation that would solve most if not all of these difficulties. The framework consists of two fundamental pieces : (1) a generalized, abstract model that captures the underlying structure of linguistic annotations ; and (2) a means to identify and formally define common («core») annotation categories, together with mechanisms to map equivalences, refine or modify existing categories, and specify hierarchical relations among categories at different levels of specificity. The aim is to provide an infrastructure for linguistic annotation that enables the commonality needed for reuse and merging, and is at the same time flexible enough to allow for user-specific annotation practices. To this end, we have outlined an annotation framework instantiated via the Extended Markup Language (XML) and the Resource Definition Framework (RDF) and demonstrated its applicability to the representation of lexical information (Ide, et al., 2000) and syntactic annotation (Ide and Romary, 2001).  In this paper, we outline the principles and mechanisms that support the proposed framework and demonstrate its use and flexibility. Because it is based in existing and emerging data representation standards and informed by state-of-the-art methods from areas such as database theory, object-oriented design, knowledge representation, etc., we feel strongly that the framework supports the most advanced and efficient means to exploit annotated corpora. The framework we are proposing would serve as a central repository and service for annotators, providing off-the-shelf formats and categories together with scripts and tools for using and modifying them, creating new categories, and mapping among them. A core feature of the framework is an RDF-based data category registry, whose implementation depends critically on input from the research community. Therefore, our goal is not only to show our results so far, but also to solicit input and feedback from corpus annotators and users that can contribute to further development.</div>
</front>
</TEI>
<affiliations><list></list>
<tree><noCountry><name sortKey="Ide, Nancy" sort="Ide, Nancy" uniqKey="Ide N" first="Nancy" last="Ide">Nancy Ide</name>
<name sortKey="Romary, Laurent" sort="Romary, Laurent" uniqKey="Romary L" first="Laurent" last="Romary">Laurent Romary</name>
</noCountry>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 008F12 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 008F12 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     CRIN:ide01c
   |texte=   Towards a common framework for linguistic annotation
}}

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022

	Serveur d'exploration sur la recherche en informatique en Lorraine
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur la recherche en informatique en Lorraine

Towards a common framework for linguistic annotation

Towards a common framework for linguistic annotation

Source :

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri